I am feeling excited. I expect to learn basically R in this course. I received an email about this course from university of Helsinki. Here is my forked repository:
GitHub repository : https://github.com/chamalee/IODS-project
This week I have covered data wrangling and linear regression analysis.
date()
## [1] "Sun Dec 12 17:26:01 2021"
learning2014 <- read.csv(file = "data/learning2014.csv", header = TRUE)
str(learning2014)
## 'data.frame': 166 obs. of 8 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(learning2014)
## [1] 166 8
This dataset contains data of 183 participants from a course which was collected in 2014. This dataset has 7 columns as follows.
Background of measures:
Measures A and C are based on parts A and C in ASSIST (Approaches and Study Skills Inventory for Students) http://www.etl.tla.ed.ac.uk/publications.html#measurement and D is based on SATS (Survey of Attitudes Toward Statistics) http://www.evaluationandstatistics.com/
The items of the central measure in this study (ASSIST B) are named so that the connections to the corresponding dimensions (Deep/SUrface/STrategic) can be easily seen.
# access the GGally and ggplot2 libraries
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
# create a more advanced plot matrix with ggpairs() : draws all possible scatter plots from the columns of a data frame, resulting in a scatter plot matrix.
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
From above graphical overview, we can see the relationships between the variables. It shows the correlations between different variables and distributions of each variables with respect to ‘gender’ variable.
The colour of plots is defined with the ‘gender’ variable. Pink color plots correspond to Female and green color for Male.
In this graphical overview, we can see the correlations between all variables. We have a negative correlation between points and age. The trend is that when age increases points decreases. And there is a positive correlation between points are attitudes. The trend is that when attitude increases point increases. If we consider correlation between surf variable and deep variable, there is a strong correlation in males where as a weak correlation in females. However, we can also observe that there are approximately twice number of more female than male.
# a scatter plot of points versus attitude
library(ggplot2)
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
# fit a linear model # create a regression model with multiple explanatory variables
my_model <- lm(points ~ attitude + stra + deep, data = learning2014)
# print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + deep, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.5239 -3.4276 0.5474 3.8220 11.5112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.3915 3.4077 3.343 0.00103 **
## attitude 3.5254 0.5683 6.203 4.44e-09 ***
## stra 0.9621 0.5367 1.793 0.07489 .
## deep -0.7492 0.7507 -0.998 0.31974
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared: 0.2097, Adjusted R-squared: 0.195
## F-statistic: 14.33 on 3 and 162 DF, p-value: 2.521e-08
Linear model: formula = points ~ attitude + stra + deep
The best model is found by minimizing the prediction errors (residuals) that model would make. Goal is to minimize the sum of squared residuals. The parameters of residuals are as follows.
Parameters of Residuals:
Min -17.5239, 1Q -3.4276, Median 0.5474, 3Q 3.8220, Max 11.5112.
Coefficients give the estimates of the model parameters corresponding to each variable along with the intercept. Intercept corresponds to the value in y-axis where linear model intersects y-axis.
Coefficients:
(Intercept) 11.3915
attitude 3.5254
stra 0.9621
deep -0.7492
P-values:
The P-value which corresponds to attitude is very small. Therefore, we can conclude there is a strong correlation between the points variable and attitude variable. Due to high p-value of deep variable, there is the least correlation between points variable and deep variable. P-values are as follows.
attitude 4.44e-09 *** stra 0.07489 .
deep 0.31974
We can even observe standard error values of parameter estimates from Std. Error column in the summary - table of coefficients.
Here, it was considered, points as the target variable and attitude, stra and deep as the explanatory variables. It implies we are trying to estimate points variable as a multiple linear model of attitude, stra and deep variables. We can see that there is a statistically significant relationship between exam points and attitude, but not with deep. This means that deep does not interpret anything about exam points, but the attitude do interpret. Therefore, deep variable was removed from the consideration and remodelled only with attitude and stra variables.
With attitude + stra + deep variables -> Multiple R-squared: 0.2097 -> approximately 20% With attitude + stra variables -> Multiple R-squared: 0.2048 -> approximately 20%
Multiple R-squared implies the multiple correlation coefficient between three or more variables. It implies how strong the linear relationship is. In other words, how well the regression model fits the observed data. This value ranges from 0 to 1. For example, an R-squared of 20% reveals that 20% of the data fit the regression model. Generally, a higher R-squared indicates a better fit for the model. In this case, we can see after we remove deep variable, Multiple R-squared approximately remains unchanged. That is because deep variable does not have a high impact towards point variable.
# a scatter plot of points versus attitude
# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra, data = learning2014)
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2))
plot(my_model2, which = c(1,2,5))
Model assumptions in linear regression are as follows.
Analyzing the residuals of the model provides a method to explore the validity of model assumptions.
Assumptions related to errors are as follows.
Residuals vs Fitted-plot provides a method to explore the assumption that the size of a given error does not depend on the explanatory variables. Any pattern will imply a problem in the assumption. As we can see in above Residuals vs Fitted-plot, there is a reasonable random spread of points. We can not observe a pattern; but a constant variance. It confirms the assumption that the size of the errors does not depend on the explanatory variables.
Normal QQ-plot provides a method to explore the assumption that errors of the model are normally distributed. As we can see in above QQ-plot, there is a reasonable fit in the middle. But in the corners, there is a clear deviation which makes the normality assumption being questionable.
The Residuals vs leverage shows how much impact single point has on the model. It helps to identify which points have an unusually high impact. As we can see in above Residuals vs leverage plot, there are no highly unusual outliers.
This week I have covered data wrangling and logistic regression analysis.
date()
## [1] "Sun Dec 12 17:26:11 2021"
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# read the data into memory
alc <- read.table("https://github.com/rsund/IODS-project/raw/master/data/alc.csv", sep=",", header=TRUE)
dim(alc)
## [1] 370 51
glimpse(alc)
## Rows: 370
## Columns: 51
## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F",…
## $ age <int> 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,…
## $ address <chr> "R", "R", "R", "R", "R", "R", "R", "R", "R", "U", "U", "U",…
## $ famsize <chr> "GT3", "GT3", "GT3", "GT3", "GT3", "GT3", "GT3", "LE3", "LE…
## $ Pstatus <chr> "T", "T", "T", "T", "T", "T", "T", "T", "T", "A", "A", "T",…
## $ Medu <int> 1, 1, 2, 2, 3, 3, 3, 2, 3, 3, 4, 1, 1, 1, 1, 1, 2, 2, 2, 3,…
## $ Fedu <int> 1, 1, 2, 4, 3, 4, 4, 2, 1, 3, 3, 1, 1, 1, 2, 2, 1, 2, 3, 2,…
## $ Mjob <chr> "at_home", "other", "at_home", "services", "services", "ser…
## $ Fjob <chr> "other", "other", "other", "health", "services", "health", …
## $ reason <chr> "home", "reputation", "reputation", "course", "reputation",…
## $ guardian <chr> "mother", "mother", "mother", "mother", "other", "mother", …
## $ traveltime <int> 2, 1, 1, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 1, 1, 1, 3, 1, 2, 1,…
## $ studytime <int> 4, 2, 1, 3, 3, 3, 3, 2, 4, 4, 2, 1, 2, 2, 2, 2, 3, 4, 1, 2,…
## $ schoolsup <chr> "yes", "yes", "yes", "yes", "no", "yes", "no", "yes", "no",…
## $ famsup <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ activities <chr> "yes", "no", "yes", "yes", "yes", "yes", "no", "no", "no", …
## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "no"…
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet <chr> "yes", "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes…
## $ romantic <chr> "no", "yes", "no", "no", "yes", "no", "yes", "no", "no", "n…
## $ famrel <int> 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 5, 5, 3, 3,…
## $ freetime <int> 1, 3, 3, 3, 2, 3, 2, 1, 4, 3, 3, 3, 3, 4, 3, 2, 2, 1, 5, 3,…
## $ goout <int> 2, 4, 1, 2, 1, 2, 2, 3, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 1, 2,…
## $ Dalc <int> 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1,…
## $ Walc <int> 1, 4, 1, 1, 3, 1, 2, 3, 3, 1, 1, 2, 3, 2, 1, 2, 1, 1, 1, 1,…
## $ health <int> 1, 5, 2, 5, 3, 5, 5, 4, 3, 4, 1, 4, 4, 5, 5, 1, 4, 3, 5, 3,…
## $ n <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ id.p <int> 1096, 1073, 1040, 1025, 1166, 1039, 1131, 1069, 1070, 1106,…
## $ id.m <int> 2096, 2073, 2040, 2025, 2153, 2039, 2131, 2069, 2070, 2106,…
## $ failures <int> 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2,…
## $ paid <chr> "yes", "no", "no", "no", "yes", "no", "no", "no", "no", "no…
## $ absences <int> 3, 2, 8, 2, 5, 2, 0, 1, 9, 10, 0, 3, 2, 0, 4, 1, 2, 6, 2, 1…
## $ G1 <int> 10, 10, 14, 10, 12, 12, 11, 10, 16, 10, 14, 10, 11, 10, 12,…
## $ G2 <int> 12, 8, 13, 10, 12, 12, 6, 10, 16, 10, 14, 6, 11, 12, 12, 14…
## $ G3 <int> 12, 8, 12, 9, 12, 12, 6, 10, 16, 10, 15, 6, 11, 12, 12, 14,…
## $ failures.p <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ paid.p <chr> "yes", "no", "no", "no", "yes", "no", "no", "no", "no", "no…
## $ absences.p <int> 4, 2, 8, 2, 2, 2, 0, 0, 6, 10, 0, 6, 2, 0, 6, 0, 0, 4, 4, 2…
## $ G1.p <int> 13, 13, 14, 10, 13, 11, 10, 11, 15, 10, 15, 11, 13, 12, 13,…
## $ G2.p <int> 13, 11, 13, 11, 13, 12, 11, 10, 15, 10, 14, 12, 12, 12, 12,…
## $ G3.p <int> 13, 11, 12, 10, 13, 12, 12, 11, 15, 10, 15, 13, 12, 12, 13,…
## $ failures.m <int> 1, 2, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,…
## $ paid.m <chr> "yes", "no", "yes", "yes", "yes", "yes", "no", "yes", "no",…
## $ absences.m <int> 2, 2, 8, 2, 8, 2, 0, 2, 12, 10, 0, 0, 2, 0, 2, 2, 4, 8, 0, …
## $ G1.m <int> 7, 8, 14, 10, 10, 12, 12, 8, 16, 10, 14, 8, 9, 8, 10, 16, 1…
## $ G2.m <int> 10, 6, 13, 9, 10, 12, 0, 9, 16, 11, 15, 0, 10, 11, 11, 15, …
## $ G3.m <int> 10, 5, 13, 8, 10, 11, 0, 8, 16, 11, 15, 0, 10, 11, 11, 15, …
## $ alc_use <dbl> 1.0, 3.0, 1.0, 1.0, 2.5, 1.0, 2.0, 2.0, 2.5, 1.0, 1.0, 1.5,…
## $ high_use <lgl> FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE,…
## $ cid <int> 3001, 3002, 3003, 3004, 3005, 3006, 3007, 3008, 3009, 3010,…
# summary(alc)
This dataset contains data of 370 participants in secondary education of two Portuguese schools. This dataset has 51 columns. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. This combined datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008].
The purpose of the analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data. 4 interesting variables : - sex : I think female student might consume less amount of alcohol with compared to male students. - age : I think higher the age, the highest consumption. - romantic : I expect that having a committed relationship will make the least alcohol consumption. - freetime : I expect that more free time will allow the students for high alcohol consumption. In my point of view, high_use variable will behave approximately similar to alc_use.
# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)
library(dplyr)
alc_questions <- c("alc_use", "high_use", "sex", "age", "romantic", "freetime")
# alc_cols %>%
# select(alc,one_of(alc_questions))
alc_cols <- alc[,alc_questions]
# create a more advanced plot matrix with ggpairs() : draws all possible scatter plots from the columns of a data frame, resulting in a scatter plot matrix.
p <- ggpairs(alc_cols, mapping = aes(col = sex, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
p <- ggpairs(alc_cols, mapping = aes(col = high_use, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
We can Immediately notice that,
#
# # access the 'tidyverse' packages dplyr and ggplot2
# library(dplyr); library(ggplot2)
#
# # define a new column alc_use by combining weekday and weekend alcohol use
# alc <- mutate(alc, alc_use = (Dalc + Walc) / 2)
#
# # initialize a plot of alcohol use
# g1 <- ggplot(data = alc, aes(x = alc_use, fill = sex))
#
# # define the plot as a bar plot and draw it
# g1 + geom_bar()
#
# # define a new logical column 'high_use'
# alc <- mutate(alc, high_use = alc_use > 2)
#
# # initialize a plot of 'high_use'
# g2 <- ggplot(alc, aes(high_use))
#
# # draw a bar plot of high_use by sex
# g2 + facet_wrap("sex") + geom_bar()
Use logistic regression to statistically explore the relationship between your chosen variables and the binary high/low alcohol consumption variable as the target variable.
model <- glm(high_use ~ age + freetime,
data = alc,
family = "binomial")
summary(model)
##
## Call:
## glm(formula = high_use ~ age + freetime, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3711 -0.8959 -0.7071 1.3148 1.9878
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.30605 1.68463 -3.150 0.00163 **
## age 0.19413 0.09738 1.994 0.04620 *
## freetime 0.37360 0.12149 3.075 0.00210 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 438.07 on 367 degrees of freedom
## AIC: 444.07
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(model)
## (Intercept) age freetime
## -5.3060538 0.1941321 0.3735975
What we can conclude from these coefficients is that:
Next, we will add romantic to the model.
model2 <- glm(high_use ~ age + freetime + romantic,
data = alc,
family = "binomial")
summary(model2)
##
## Call:
## glm(formula = high_use ~ age + freetime + romantic, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4318 -0.8620 -0.7207 1.3094 1.9994
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.47316 1.69354 -3.232 0.00123 **
## age 0.20842 0.09841 2.118 0.03418 *
## freetime 0.37709 0.12165 3.100 0.00194 **
## romanticyes -0.26045 0.25381 -1.026 0.30482
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 437.00 on 366 degrees of freedom
## AIC: 445
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(model2)
## (Intercept) age freetime romanticyes
## -5.4731577 0.2084192 0.3770938 -0.2604451
Very high p-value of romantic indicates that it is the least significant to high_use. We can conclude here that my previous hypothesis is not fully accurate since romantic does not show a strong correlation with high_use.
Next, we will add sex to the model. It will create my previously stated hypothesis model.
model3 <- glm(high_use ~ age + freetime + romantic + sex,
data = alc,
family = "binomial")
summary(model3)
##
## Call:
## glm(formula = high_use ~ age + freetime + romantic + sex, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5445 -0.8506 -0.6576 1.2234 2.1190
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.74785 1.72535 -3.331 0.000864 ***
## age 0.21575 0.09994 2.159 0.030870 *
## freetime 0.29151 0.12479 2.336 0.019491 *
## romanticyes -0.20455 0.25873 -0.791 0.429182
## sexM 0.80662 0.24192 3.334 0.000855 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 425.66 on 365 degrees of freedom
## AIC: 435.66
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(model3)
## (Intercept) age freetime romanticyes sexM
## -5.7478538 0.2157461 0.2915068 -0.2045533 0.8066158
Very low p-value of sex indicates that it is the most significant to high_use.
We can finalize the model as follows with the most significant variables.
model4 <- glm(high_use ~ age + freetime + sex,
data = alc,
family = "binomial")
summary(model4)
##
## Call:
## glm(formula = high_use ~ age + freetime + sex, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4981 -0.8659 -0.6393 1.2428 2.0520
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.62126 1.71834 -3.271 0.001070 **
## age 0.20483 0.09903 2.068 0.038605 *
## freetime 0.28665 0.12457 2.301 0.021383 *
## sexM 0.81972 0.24134 3.396 0.000683 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 426.29 on 366 degrees of freedom
## AIC: 434.29
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(model4)
## (Intercept) age freetime sexM
## -5.6212569 0.2048304 0.2866508 0.8197204
let’s see the odds ratios with their CI’s.
OR <- model4 %>%
coef %>%
exp
CI <- model4 %>%
confint %>%
exp
## Waiting for profiling to be done...
m_OR <- cbind(OR, CI)
m_OR
## OR 2.5 % 97.5 %
## (Intercept) 0.003620088 0.0001161896 0.09987401
## age 1.227316904 1.0125010025 1.49443547
## freetime 1.331958997 1.0464192521 1.70714861
## sexM 2.269865003 1.4192715886 3.66181832
What we can conclude from these coefficients is that:
2X2 cross tabulation can give us an idea of the predictive power of the model.
probs <- predict(model4, type = "response") # Predict probability responses based on the variables
a_predict <- alc %>%
mutate(probability = probs) %>% # add them to our data.frame
mutate(prediction = probability > 0.5) # Compare those probabilities to our significance cutoff to identify the logical value
# Plot the 2X2 cross tabulation of the predictions
table(high_use = a_predict$high_use,
prediction = a_predict$prediction) %>%
prop.table %>%
addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.66486486 0.03513514 0.70000000
## TRUE 0.25945946 0.04054054 0.30000000
## Sum 0.92432432 0.07567568 1.00000000
Approximately 70% of the values were predicted correctly.
Let’s compare the performance with some simple guessing model.
model5 <- glm(high_use ~ failures,
data = alc,
family = "binomial")
probs <- predict(model5, type = "response") # Predict probability responses based on the variables
a_predict <- alc %>%
mutate(probability = probs) %>% # add them to our data.frame
mutate(prediction = probability > 0.5) # Compare those probabilities to our significance cutoff to identify the logical value
# Plot the 2X2 cross tabulation of the predictions
table(high_use = a_predict$high_use,
prediction = a_predict$prediction) %>%
prop.table %>%
addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67567568 0.02432432 0.70000000
## TRUE 0.26756757 0.03243243 0.30000000
## Sum 0.94324324 0.05675676 1.00000000
In new model too, approximately 70% of the values were predicted correctly.
Cross-validation is a method of testing a predictive model on unseen data. In cross-validation, the value of a penalty (loss) function (mean prediction error) is computed on data not used for finding the model. Low value = good.
Cross-validation gives a good estimate of the actual predictive power of the model. It can also be used to compare different models or classification methods.
# the logistic regression model m and dataset alc (with predictions) are available
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] NaN
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = model4, K = 10)
This week I have covered clustering and classification and data wrangling for next week exercise.
date()
## [1] "Sun Dec 12 17:26:23 2021"
# access the MASS package
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# load the data
data("Boston")
# explore the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
This dataset contains information of 506 houses in Boston suburbs. This dataset has 14 columns. The data attributes include features such as per capita crime rate by town, proportion of residential land zoned for lots over 25,000 sq.ft., social and Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)) and so on. You may see the all the features explained in brief.
crim :per capita crime rate by town. zn : proportion of residential land zoned for lots over 25,000 sq.ft. indus : proportion of non-retail business acres per town. chas :Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). nox : nitrogen oxides concentration (parts per 10 million). rm :average number of rooms per dwelling. age :proportion of owner-occupied units built prior to 1940. dis :weighted mean of distances to five Boston employment centres. rad :index of accessibility to radial highways. tax :full-value property-tax rate per $10,000. ptratio : pupil-teacher ratio by town. black :1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. lstat : lower status of the population (percent). medv : median value of owner-occupied homes in $1000s.
# MASS, corrplot, tidyr and Boston dataset are available
library(tidyr)
library(corrplot)
## corrplot 0.92 loaded
# plot matrix of the variables
pairs(Boston)
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)
# print the correlation matrix
cor_matrix
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)
# access the GGally and ggplot2 libraries
library(GGally)
library(ggplot2)
library(dplyr)
# create a more advanced plot matrix with ggpairs() : draws all possible scatter plots from the columns of a data frame, resulting in a scatter plot matrix.
p <- ggpairs(Boston, mapping = aes( alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)),upper = list(continuous = wrap("cor", size = 2)))
# draw the plot
p
We can Immediately notice that,
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
As we can observe above from the summary of the variables, we can notice that the distribution of values of each column variables have been scaled such that mean of the column data is zero and standard deviation is 1. Max and min values corresponding to a particular column too have adjusted accordingly.
We can create a categorical variable from a continuous one. There are many ways to to do that. Let’s choose the variable crim (per capita crime rate by town) to be our factor variable. We want to cut the variable by quantiles to get the high, low and middle rates of crime into their own categories.
# summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Divide and conquer: train and test sets When we want to use a statistical method to predict something, it is important to have data to test how well the predictions fit. Splitting the original data to test and train sets allows us to check how well our model works.
The training of the model is done with the train set and prediction on new data is done with the test set. This way you have true classes / labels for the test data, and you can calculate how well the model performed in prediction.
Time to split our data!
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
Linear Discriminant analysis is a classification (and dimension reduction) method. It finds the (linear) combination of the variables that separate the target variable classes. The target can be binary or multiclass variable. Linear discriminant analysis is closely related to many other methods, such as principal component analysis.
We are going to fit a linear discriminant analysis using the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables.
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2599010 0.2450495 0.2475248 0.2475248
##
## Group means:
## zn indus chas nox rm age
## low 0.95894437 -0.8796740 -0.122344297 -0.8641363 0.48552337 -0.8787176
## med_low -0.09744804 -0.3161204 0.006051757 -0.5474173 -0.10694222 -0.2734076
## med_high -0.38261997 0.1831570 0.200122961 0.3686210 0.09095368 0.3401594
## high -0.48724019 1.0171519 0.003267949 1.0541715 -0.50528854 0.7993609
## dis rad tax ptratio black lstat
## low 0.8208285 -0.6909262 -0.7454004 -0.41145224 0.38323406 -0.77953149
## med_low 0.3201515 -0.5491659 -0.5030501 -0.04897982 0.33012366 -0.13624896
## med_high -0.3628665 -0.4248647 -0.3228793 -0.30418022 0.06329128 -0.01581883
## high -0.8632369 1.6377820 1.5138081 0.78037363 -0.78933500 0.91085012
## medv
## low 0.56193189
## med_low 0.03652025
## med_high 0.16289841
## high -0.62898128
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.10048185 0.84709912 -0.86771216
## indus 0.01430006 -0.20400892 0.13255228
## chas -0.08423338 -0.06145215 0.13615290
## nox 0.40627266 -0.67896625 -1.43279373
## rm -0.14411254 -0.07284770 -0.24080921
## age 0.20215813 -0.26967929 0.11263569
## dis -0.06795632 -0.36946615 0.29528597
## rad 3.34871981 1.04712118 -0.02047689
## tax 0.09928549 -0.21586004 0.58922078
## ptratio 0.11818159 0.07397334 -0.20225905
## black -0.11139782 0.02154920 0.14308332
## lstat 0.27530756 -0.26523817 0.30670809
## medv 0.25956957 -0.34512462 -0.17688985
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9576 0.0319 0.0105
LDA can be visualized with a biplot.
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
It can be observed that rad has the most influence over the separation of the model. Moreover, it implies that rad variable is more likely to influence the most on the crim variable.
Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results.
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 11 11 0 0
## med_low 6 18 3 0
## med_high 0 11 13 2
## high 0 0 0 27
We can observe that diagonals of the cross tabulated table (14,18,11,22) represent correctly classified crime categories where as off-diagonals represent the mis-classifications. Correct classifications will be interpreting low as low, med low as med low, med high as med high and high as high. All others are mis-classifications. Correct classifications = 14+18+11+22 = 65 Total mis-classifications = 5+16+1+2+3+10= 37
Reload the Boston dataset and standardize the dataset. Calculate the distances between the observations.
# load MASS and Boston
library(MASS)
data('Boston')
# center and standardize variables
boston_scaled <- scale(Boston)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# euclidean distance matrix
dist_eu <- dist(Boston)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.119 85.624 170.539 226.315 371.950 626.047
# manhattan distance matrix
dist_man <- dist(Boston, method = 'manhattan')
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.016 149.145 279.505 342.899 509.707 1198.265
K-means is maybe the most used and known clustering method. It is an unsupervised method, that assigns observations to groups or clusters based on similarity of the objects. In the previous exercise we got a hang of distances. The kmeans() function counts the distance matrix automatically, but it is good to know the basics. Let’s cluster a bit!
# k-means clustering
km <-kmeans(Boston, centers = 3)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
K-means needs the number of clusters as an argument. There are many ways to look at the optimal number of clusters and a good way might depend on the data you have.
One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. When you plot the number of clusters and the total WCSS, the optimal number of clusters is when the total WCSS drops radically.
K-means might produce different results every time, because it randomly assigns the initial cluster centers. The function set.seed() can be used to deal with that.
Steps are as follows.
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
The optimal number of clusters is when the value of total WCSS changes radically. In this case, we observe that radical change from 1 to 2. Hence, two clusters would seem optimal.
Next, Run kmeans() again with two clusters and visualize the results.
# k-means clustering
km <-kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influencial linear separators for the clusters?
data("Boston")
boston_scaled <- Boston %>%
scale %>%
as.data.frame
set.seed(123)
# k-means clustering
km <- kmeans(boston_scaled, centers = 3)
# clusters
boston_b <- boston_scaled %>%
mutate(crim = as.factor(km$cluster))
# number of rows in the Boston dataset
n <- nrow(boston_b)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_b
# create test set
test <- boston_b
correct_classes <- test$crim
test <- dplyr::select(test, -crim)
# perform LDA using the clusters as target classes
lda.fit <- lda(crim ~ ., data = train)
# target classes as numeric
classes <- as.numeric(train$crim)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)
It can be observed that rad has the most influence for the separation of the model.
Moreover, it implies that rad variable is influencing the most into cluster 1.
Next, age variable is the second most influential variable for the separation of the model.
age variable is highly likely to separate cluster 2 and 3 with respect to LD 2 axis.
This week I have covered Dimensionality reduction techniques and data wrangling exercise.
date()
## [1] "Sun Dec 12 17:26:55 2021"
# read the human data from URL
human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", sep =",", header = T)
# human <- read.table("data/human.csv", sep = ";")
dim(human)
## [1] 155 8
# look at the (column) names of human
names(human)
## [1] "Edu2.FM" "Labo.FM" "Edu.Exp" "Life.Exp" "GNI" "Mat.Mor"
## [7] "Ado.Birth" "Parli.F"
The origin of the data is the United Nations Development Programme. This dataset contains information of 155 countries. This dataset has 8 columns. The data attributes include features such as Gross National Income, Life expectancy at birth, Expected years of schooling and so on.
You may see all features explained in brief.
# look at the structure of human
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
# print out summaries of the variables
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
# Access GGally
library(GGally)
library(corrplot)
# visualize the 'human' variables
ggpairs(human)
# compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot
corrplot.mixed(cor(human, use = "pairwise"), order ="hclust")
We can Immediately notice that,
Principal Component Analysis (PCA) can be performed by two sightly different matrix decomposition methods from linear algebra: the Eigenvalue Decomposition and the Singular Value Decomposition (SVD).
There are two functions in the default package distribution of R that can be used to perform PCA: princomp() and prcomp(). The prcomp() function uses the SVD and is the preferred, more numerically accurate method.
Both methods quite literally decompose a data matrix into a product of smaller matrices, which let’s us extract the underlying principal components. This makes it possible to approximate a lower dimensional representation of the data by choosing only a few principal components.
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)
# print out summaries of the non-standardized variables
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
A biplot is a way of visualizing the connections between two representations of the same data. First, a simple scatter plot is drawn where the observations are represented by two principal components (PC’s). Then, arrows are drawn to visualize the connections between the original variables and the PC’s.
The following connections hold:
The angle between the arrows can be interpret as the correlation between the variables. The angle between a variable and a PC axis can be interpret as the correlation between the two. The length of the arrows are proportional to the standard deviations of the variables.
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
# standardize the variables
human_std <- scale(human)
# print out summaries of the standardized variables
summary(human_std)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
We can Immediately notice that results with and without standardizing are different.
After standardization, each column variables have been scaled such that mean of the column data is zero and standard deviation is 1.
In bi-plot, the length of the arrow is proportional to standard deviation of features/variables.
Since data is standardized, we can observe almost similar lengths in all arrows. It concludes that all variables have equal standard deviations.
In bi-plot, Small angle implies high positive correlation.
We can see that Labo.FM variable is almost orthogonal to Edu.Exp variable. We do not see a strong correlation between Edu.Exp variable and Labo.FM. That is because Expected years of schooling does not depend on Proportion of females in the labour force / Proportion of males in the labour force variable. These two variables can be considered as almost independent.
It can be observed that there is a small angle between Mat.Mor and Ado.Birth. They are pointing to the same direction. That is due to the fact that Ado.Birth has a strong positive correlation with Mat.Mor. Maternal mortality ratio increases with Adolescent birth rate.
On the other hand, Mat.Mor and Ado.Birth almost align with first principle component. It implies that they contribute to that dimension.
The Factominer package contains functions dedicated to multivariate explanatory data analysis. It contains for example methods (Multiple) Correspondence analysis , Multiple Factor analysis as well as PCA.
In the next exercises we are going to use the tea dataset. The dataset contains the answers of a questionnaire on tea consumption.
library(FactoMineR)
# load the data set from FactoMineR-package
data("tea")
# the structure and the dimensions of the data and visualize
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 15-24:92 1/day : 95
## middle :40 sportsman :179 25-34:69 1 to 2/week: 44
## non-worker :64 35-44:40 +2/day :127
## other worker:20 45-59:61 3 to 6/week: 34
## senior :35 +60 :38
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming exciting
## feminine :129 Not.sophisticated: 85 No.slimming:255 exciting :116
## Not.feminine:171 sophisticated :215 slimming : 45 No.exciting:184
##
##
##
##
##
## relaxing effect.on.health
## No.relaxing:113 effect on health : 66
## relaxing :187 No.effect on health:234
##
##
##
##
##
library(tidyr)
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, one_of(keep_columns))
# look at the summaries and structure of the data
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
Multiple Correspondence Analysis (MCA) is a method to analyze qualitative data and it is an extension of Correspondence analysis (CA). MCA can be used to detect patterns or structure in the data as well as in dimension reduction.
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.279 0.261 0.219 0.189 0.177 0.156 0.144
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519 7.841
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953 77.794
## Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.141 0.117 0.087 0.062
## % of var. 7.705 6.392 4.724 3.385
## Cumulative % of var. 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139 0.003
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626 0.027
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111 0.107
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841 0.127
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979 0.035
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990 0.020
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347 0.102
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459 0.161
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968 0.478
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898 0.141
## v.test Dim.3 ctr cos2 v.test
## black 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 2.867 | 0.433 9.160 0.338 10.053 |
## green -5.669 | -0.108 0.098 0.001 -0.659 |
## alone -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 3.226 | 1.329 14.771 0.218 8.081 |
## milk 2.422 | 0.013 0.003 0.000 0.116 |
## other 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")
In the above MCA factor map, the variables are drawn only for the first two dimensions.
The plot above helps to identify categories that are the most correlated with each dimension. The squared correlations between categories and the dimensions are used as coordinates.
It can be seen that, the category variables alone and lunch are the most correlated with dimension 1. Similarly, the variables sugar and No.Sugar are the most correlated with dimension 2.
Moreover, it is more likely to have unpackaged tea in a tea shop while tea bag tea in chain store. These two pairs can be observed together in the plot.
# draw a screeplot of the MCA
library("factoextra")
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_screeplot(mca, addlabels = TRUE, ylim = c(0, 45))
# draw a biplot of the MCA and the original variables
mca <- MCA(tea_time, graph = FALSE)
library("factoextra")
# eig.val <- get_eigenvalue(mca)
# fviz_screeplot(mca, addlabels = TRUE, ylim = c(0, 45))
fviz_mca_biplot(mca,
repel = TRUE, # Avoid text overlapping (slow if many point)
ggtheme = theme_minimal())
## Warning: ggrepel: 190 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
The plot above helps to identify variables that are the most correlated with each dimension. The squared correlations between variables and the dimensions are used as coordinates.
Active individuals are in blue in the plot.
Active variable categories are in red in the plot.
Individuals such as 171,277 and 278 are strongly correlated with dim 1 where as 72,270 and 15 are strongly correlated with dim 2.
Reference:
Datacamp : https://campus.datacamp.com/courses/helsinki-open-data-science/dimensionality-reduction-techniques
This week I have covered analysis of longitudinal data and data wrangling exercise.
date()
## [1] "Sun Dec 12 17:27:07 2021"
Many studies in the behavioral sciences involve several measurement or observations of the response variable of interest on each subject in the study.
For example, the response variable may be measured under a number of different occasions over time.
Such data are called longitudinal data.
RATS data is extracted from a nutrition study conducted in three groups of rats (Crowder and Hand, 1990).
The three groups were put on different diets, and each animal’s body weight (grams) was recorded repeatedly (approximately weekly, except in week seven when two recordings were taken) over a 9-week period.
The focus here is to check whether the growth profiles of the three groups differ.
In this BPRS dataset, 40 male subjects were randomly assigned to one of two treatment groups and each subject was rated on the brief psychiatric rating scale (BPRS) measured before treatment began (week 0) and then at weekly intervals for eight weeks.
The BPRS assesses the level of 18 symptom constructs such as hostility, suspiciousness, hallucinations and grandiosity; each of these is rated from one (not present) to seven (extremely severe).
The scale is used to evaluate patients suspected of having schizophrenia.
date()
## [1] "Sun Dec 12 17:27:07 2021"
# packages in use
library(dplyr)
library(ggplot2)
library(tidyr)
bprs = read.csv("data/BPRSL.csv", as.is = TRUE, header = TRUE, row.names = NULL)
rats = read.csv("data/RATSL.csv", as.is = TRUE, header = TRUE, row.names = NULL)
str(bprs)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: int 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : chr "week0" "week0" "week0" "week0" ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
str(rats)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : int 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : chr "WD1" "WD1" "WD1" "WD1" ...
## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
We should factor the factorial variables again.
# Factor variables ID and Group
rats$ID <- factor(rats$ID)
rats$Group <- factor(rats$Group)
str(rats)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : chr "WD1" "WD1" "WD1" "WD1" ...
## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
# # Factor treatment & subject
# bprs$treatment <- factor(bprs$treatment)
# bprs$subject <- factor(bprs$subject)
# str(bprs)
The data is based on a study conducted in three groups of rats and observed weights over time. Let’s check the group-wise weight differences of rats.
ggplot(rats, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~Group, labeller = label_both) +
scale_y_continuous(limits = c(min(rats$Weight), max(rats$Weight)), name = "Non-standardized weights")
We can Immediately notice that,
Group-1 rats have gained lower weights with compared to Group-2 and Group-3 .
The rats in Group-2 and Group-3 achieved almost similar weights.
# Standardize
rats <- rats %>%
group_by(Time) %>%
mutate(stdWeight = (Weight - mean(Weight))/sd(Weight) ) %>%
ungroup()
# Plot with the standardized data
ggplot(rats, aes(x = Time, y = stdWeight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "Standardized weights")
Even after standardization of weights, we can Immediately notice that,
Group-1 rats have gained lower weights with compared to Group-2 and Group-3 .
The rats in Group-2 and Group-3 achieved almost similar weights.
The summary measure method operates by transforming the repeated measurements made on each individual in the study into a single value that captures some essential feature of the individual’s response over time.
The key step to a successful summary measure analysis of longitudinal data is the choice of a relevant summary measure.
rats_l <- rats %>%
group_by(Group, Time) %>%
summarise(mean = mean(Weight), se = sd(Weight) / sqrt(n())) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
# Plot the mean profiles
ggplot(rats_l , aes(x = Time, y = mean, linetype = Group, shape = Group, col = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.8,0.4)) +
scale_y_continuous(name = "Mean (weight) +/- se (weight)")
We can Immediately notice that,
Mean weights of Group-1 rats are again lower with compared to Group-2 and Group-3 .
The rats in Group-3 always achieved the highest mean weights over the entire observation period.
The rats in Group-2 achieved almost similar mean weights to Group-3, but lesser than Group-3.
ggplot(rats, aes(x = as.factor(Time), y = Weight, fill = Group)) +
geom_boxplot(position = position_dodge(width = 0.9)) +
theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
theme(legend.position = c(0.8,0.4)) +
scale_x_discrete(name = "Time")
The boxplots of the Group-1 and Group-3 reveal more outliers with compared to Group-2.
The diagram indicates that the weight is more variable in Group-2. That is because height of boxes are higher in Group-2 with compared to other groups.
rats_summary <- rats %>%
filter(Time > 1) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the `.groups` argument.
ggplot(rats_summary, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=22, size=3, fill = "white") +
scale_y_continuous(name = "Mean Weight")
The diagram indicates that the mean summary measure is more variable in Group-2 and its distribution in this group is somewhat skew.
The boxplots of all groups reveal outliers.
The distribution of Group-2 and Group-3 are more skewed where as the distribution in Group-1 is more balanced.
rats_out <- rats_summary %>%
filter((Group==1 & mean > 250)|(Group==2 & mean < 500)| (Group==3 & mean > 500))
ggplot(rats_out, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=22, size=3, fill = "white") +
scale_y_continuous(name = "Mean Weight")
After we have removed the outliers,
The diagram indicates that the height of all boxes has reduced indicating that variance of all groups goes down. Specially the varaince of Group-2 is now far reduced.
The distribution in Group-1 is now more skewed similarly to The distributions of Group-2 and Group-3.
# Fit the linear model
fit <- lm(mean ~ Group, data = rats_summary)
# summary of the fitted model
summary(fit)
##
## Call:
## lm(formula = mean ~ Group, data = rats_summary)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.900 -27.169 2.325 9.069 106.200
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 265.02 12.98 20.419 2.92e-11 ***
## Group2 222.77 22.48 9.909 2.00e-07 ***
## Group3 262.48 22.48 11.675 2.90e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.71 on 13 degrees of freedom
## Multiple R-squared: 0.9316, Adjusted R-squared: 0.9211
## F-statistic: 88.52 on 2 and 13 DF, p-value: 2.679e-08
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 238620 119310 88.525 2.679e-08 ***
## Residuals 13 17521 1348
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The P-value which corresponds to both Group-2 and Group-3 is very small. Therefore, we can conclude there is a strong correlation between the Weight-mean variable and both Group-2 and Group-3. P-values are as follows. Group2 -> 2.00e-07 and Group3 -> 2.90e-08.
For the analysis of variance (ANOVA), To determine whether any of the differences between the means are statistically significant, we can compare the p-value to significance level to assess the null hypothesis. The null hypothesis states that the population means are all equal. Usually, a significance level (denoted as α or alpha) of 0.05 works well. A significance level of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference.
Here small p-value indicates that a difference exists between groups of rats. The risk here is very small.
# Factor treatment & subject
bprs$treatment <- factor(bprs$treatment)
bprs$subject <- factor(bprs$subject)
str(bprs)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : chr "week0" "week0" "week0" "week0" ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
bprs %>%
ggplot(aes(x = week, y = bprs)) +
geom_line(aes(linetype = subject)) +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_wrap("treatment", labeller = label_both) +
theme_bw()
# create a regression model
bprs_lm <- lm(bprs ~ week + treatment, data = bprs)
# print out a summary of the model
summary(bprs_lm)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = bprs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
treatment-2 has a higher p-value which indicates independance w.r.t bprs variable.
The P-value which corresponds to both week is very small. Therefore, we can conclude there is a strong correlation between the week variable and bprs variable .
Linear mixed models allow for correlations between the repeated measurements by introducing random effects for subjects.
Here, we fit the data to a random intercept model with week and treatment as the variables which affects BPRS.
The random intercept model has two parts. It’s got a fixed part (which is the intercept and the coefficient of the explanatory variable x the explanatory variable) and it’s got a random part.
# access library lme4
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
# Create a random intercept model
bprs_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = bprs, REML = FALSE)
# Print the summary of the model
summary(bprs_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: bprs
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
We can have the attention towards variability (standard deviation) in random-part first.
Standard deviation of Residual is higher than the standard deviation of subject.
This higher standard deviation illustrates a higher varaition in residuals.
Now we can move on to fit the random intercept and random slope model to the data.
Fitting a random intercept and random slope model allows the linear regression fits for each individual to differ in intercept but also in slope.
This way it is possible to account for the individual differences in the subjects bprs profiles, but also the effect of time.
# create a random intercept and random slope model
bprs_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = bprs, REML = FALSE)
# print a summary of the model
summary(bprs_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: bprs
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
# perform an ANOVA test on the two models
anova(bprs_ref1, bprs_ref)
## Data: bprs
## Models:
## bprs_ref: bprs ~ week + treatment + (1 | subject)
## bprs_ref1: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## bprs_ref 5 2748.7 2768.1 -1369.4 2738.7
## bprs_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s pay attention to the chi-squared statistics and p-value of the likelihood ratio test between bprs_ref1 and bprs_ref.
The lower the value the better the fit against the comparison model.
As we can see in above table, the lower chi-squared statistics which is 0.026 indicates that the random slope and intercept model fits our data better.
Finally, we can fit a random intercept and slope model that allows for a treatment Ă— week interaction.
# create a random intercept and random slope model
bprs_ref2 <-lmer(bprs ~ week * treatment + (week | subject), data = bprs, REML = FALSE)
# print a summary of the model
summary(bprs_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
## Data: bprs
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2521 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
# perform an ANOVA test on the two models
anova(bprs_ref2, bprs_ref1)
## Data: bprs
## Models:
## bprs_ref1: bprs ~ week + treatment + (week | subject)
## bprs_ref2: bprs ~ week * treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## bprs_ref1 7 2745.4 2772.6 -1365.7 2731.4
## bprs_ref2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s again pay attention to the chi-squared statistics and p-value of the likelihood ratio test between bprs_ref1 and bprs_ref2.
The lower the value the better the fit against the comparison model.
As we can see in above table, the higher chi-squared statistics which is 0.074 indicates that the random slope and intercept model does NOT fit our data better.
# Create a vector of the fitted values
Fitted <- fitted(bprs_ref2)
# Create a new column fitted to BPRS
bprs <- bprs %>%
mutate(Fitted)
ggplot(bprs,aes(x = week, y = bprs)) +
geom_line(aes(linetype = subject)) +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_wrap("treatment", labeller = label_both) +
theme_bw() +
theme(legend.position = "none")
ggplot(bprs,aes(x = week, y = Fitted)) +
geom_line(aes(linetype = subject)) +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_wrap("treatment", labeller = label_both) +
theme_bw() +
theme(legend.position = "none")
That’s the end of course! Thanks to everyone!